Skip to content

fix: recovery from crash during optimization#246

Open
zhourrr wants to merge 5 commits intomainfrom
feat/more_ut
Open

fix: recovery from crash during optimization#246
zhourrr wants to merge 5 commits intomainfrom
feat/more_ut

Conversation

@zhourrr
Copy link
Collaborator

@zhourrr zhourrr commented Mar 20, 2026

Greptile Summary

This PR adds a crash recovery test for the Optimize operation, introduces a new collection_optimizer helper binary, and bundles several housekeeping fixes: typo corrections (CreateComapctTaskCreateCompactTask, new_segmnet_metanew_segment_meta, etc.), a parameter rename for clarity (createcreate_if_missing), and a relaxation of the floating-point comparison tolerance in Doc::operator==.

Key changes:

  • New collection_optimizer binary that opens a collection and calls Optimize(), used as a subprocess that can be killed to simulate a crash.
  • New OptimizeRecoveryTest::CrashDuringOptimize test that inserts 100 000 documents, crashes the optimizer mid-run, and then verifies the collection can be re-opened and data is intact.
  • Rename of all six CreateComapctTask call-sites (source + tests) to CreateCompactTask.
  • Typo fixes in parameter names and log messages across segment.cc, segment.h, and index_storage.h.
  • Float/double comparison epsilon in doc.cc loosened (1e-6f1e-4f, 1e-91e-6).

Issues found:

  • The final data-integrity verification block in CrashDuringOptimize (after the second RunOptimizer call) calls collection->Insert() instead of collection->Fetch() + compare — it inserts documents instead of verifying them, so data corruption would go undetected.
  • write_recovery_test.cc SetUp/TearDown still reference the old crash_test_db path after the variable was renamed to write_recovery_test_db, leaving the test directory uncleaned between runs.
  • collection_optimizer.cc opens the collection with CollectionOptions{false, true} (no segment limit), while the test creates/reopens it with CollectionOptions{false, true, 256 * 1024}, causing a configuration mismatch that may affect whether compaction is triggered.

Confidence Score: 2/5

  • Not safe to merge: the new crash-recovery test has a copy-paste bug that makes the final data-integrity verification a no-op, and the write recovery test has stale cleanup paths.
  • Three P1 bugs in the new test files: (1) the final verification loop inserts documents instead of fetching and comparing them — the test's core correctness guarantee is missing; (2) SetUp/TearDown in write_recovery_test.cc delete the wrong directory, causing test pollution; (3) the collection_optimizer binary uses different CollectionOptions than the test, potentially preventing the optimizer from exercising the intended code paths. The production-code changes (typo renames, epsilon relaxation) are low-risk, but the test infrastructure issues are serious enough to block merging.
  • tests/db/crash_recovery/optimize_recovery_test.cc (final verification loop), tests/db/crash_recovery/write_recovery_test.cc (SetUp/TearDown paths), tests/db/crash_recovery/collection_optimizer.cc (CollectionOptions mismatch)

Important Files Changed

Filename Overview
tests/db/crash_recovery/optimize_recovery_test.cc New crash-recovery test for optimize; has a critical copy-paste bug where the final data-integrity loop calls Insert() instead of Fetch(), a flaky WIFSIGNALED assertion, and commented-out dead code.
tests/db/crash_recovery/write_recovery_test.cc Renamed collection/dir identifiers but SetUp and TearDown still reference the old "crash_test_db" path, leaving write_recovery_test_db uncleaned between runs.
tests/db/crash_recovery/collection_optimizer.cc New optimizer helper binary; opens collection with CollectionOptions{false, true} missing the 256*1024 segment limit used by the test, and is missing std::endl on the final stats output line.
tests/db/crash_recovery/CMakeLists.txt Extracts common library list into CRASH_RECOVERY_COMMON_LIBS variable and adds collection_optimizer build target and dependency; clean refactor.
src/db/collection.cc Fixes three call-sites of the renamed CreateComapctTask → CreateCompactTask; straightforward typo correction.
src/db/index/segment/segment_helper.h Renames CreateComapctTask to CreateCompactTask; simple typo fix with all callers updated.
src/db/index/common/doc.cc Loosens float comparison tolerance from 1e-6f to 1e-4f and double from 1e-9 to 1e-6; may be needed for round-trip precision but is a meaningful behavioral change to Doc equality.

Sequence Diagram

sequenceDiagram
    participant T as Test
    participant C as Collection in-process
    participant O as collection_optimizer subprocess
    participant D as On-disk Collection

    T->>C: CreateAndOpen(dir_path_)
    T->>C: Insert 100k docs
    C->>D: Persist segments
    T->>C: collection.reset()

    T->>O: fork + execvp RunOptimizerAndCrash 4s
    O->>D: Open with CollectionOptions{false,true}
    O->>D: Optimize compaction starts
    T-->>O: SIGKILL after 4s

    T->>C: Open for recovery
    T->>C: Fetch all 100k docs and verify integrity
    T->>C: Insert 50k more docs
    T->>C: collection.reset()

    T->>O: fork + execvp RunOptimizer no crash
    O->>D: Open and Optimize completes cleanly
    O-->>T: exit 0

    T->>C: Open final
    T->>C: Stats assert 150k docs
    T->>C: BUG Insert 150k docs instead of Fetch and verify
Loading

Comments Outside Diff (2)

  1. tests/db/crash_recovery/write_recovery_test.cc, line 129 (link)

    P1 SetUp/TearDown reference stale path after rename

    dir_path_ was renamed from "crash_test_db" to "write_recovery_test_db" in this PR, but both SetUp (line 129) and TearDown (line 134) still attempt to delete the old "./crash_test_db" path. As a result:

    • The actual test directory write_recovery_test_db is never cleaned up between test runs, causing test pollution and potential false-positives on repeated runs.
    • The delete of crash_test_db is a no-op since that directory no longer exists.
  2. tests/db/crash_recovery/write_recovery_test.cc, line 134 (link)

    P1 Stale path in TearDown after rename

    Same rename miss as in SetUp. The TearDown must also reference the updated directory name.

Last reviewed commit: "fix: fix some typos"

Greptile also left 5 inline comments on this PR.

Comment on lines +185 to +195
for (int batch = 0; batch < (num_batches + 1000); batch++) {
std::vector<Doc> docs;
for (int i = 0; i < batch_size; i++) {
docs.push_back(CreateTestDoc(batch * batch_size + i, 0));
}
auto write_result = collection->Insert(docs);
ASSERT_TRUE(write_result);
for (auto &s : write_result.value()) {
ASSERT_TRUE(s.ok());
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Final verification loop inserts instead of verifying

The comment on line 178 says "verify data integrity", but the loop body calls collection->Insert() instead of collection->Fetch() + comparing results. This is a copy-paste error from the data population block above. The test will modify the collection instead of validating it, and any data-integrity bug would go undetected here.

Compare this to the correctly-written verification block at lines 142–157, which uses collection->Fetch(pks) and then checks *actual_doc == expected_doc.

Suggested change
for (int batch = 0; batch < (num_batches + 1000); batch++) {
std::vector<Doc> docs;
for (int i = 0; i < batch_size; i++) {
docs.push_back(CreateTestDoc(batch * batch_size + i, 0));
}
auto write_result = collection->Insert(docs);
ASSERT_TRUE(write_result);
for (auto &s : write_result.value()) {
ASSERT_TRUE(s.ok());
}
}
for (int batch = 0; batch < (num_batches + 1000); batch++) {
for (int i = 0; i < batch_size; i++) {
Doc expected_doc = CreateTestDoc(batch * batch_size + i, 0);
std::vector<std::string> pks{expected_doc.pk()};
if (auto res = collection->Fetch(pks); res) {
auto map = res.value();
if (map.find(expected_doc.pk()) == map.end()) {
FAIL() << "Returned map does not contain doc[" << expected_doc.pk() << "]";
}
const auto actual_doc = map.at(expected_doc.pk());
ASSERT_EQ(*actual_doc, expected_doc)
<< "Data mismatch for doc[" << expected_doc.pk() << "]";
} else {
FAIL() << "Failed to fetch doc[" << expected_doc.pk() << "]";
}
}
}

Comment on lines +80 to +81
auto result =
zvec::Collection::Open(config.path, zvec::CollectionOptions{false, true});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 CollectionOptions mismatch with test configuration

The test file optimize_recovery_test.cc creates and reopens the collection with zvec::CollectionOptions{false, true, 256 * 1024} (a max-doc-per-segment limit of 256 * 1024), but the collection_optimizer binary opens the same on-disk collection with only zvec::CollectionOptions{false, true} (no segment size limit). This mismatch means the optimizer may behave differently than expected during the crash recovery test — for example, it might not trigger segment compaction at the same thresholds, leading to a test that doesn't actually exercise the intended code paths.

Suggested change
auto result =
zvec::Collection::Open(config.path, zvec::CollectionOptions{false, true});
zvec::Collection::Open(config.path, zvec::CollectionOptions{false, true, 256 * 1024});

Comment on lines +91 to +92
ASSERT_TRUE(WIFSIGNALED(status))
<< "Child process was not killed by a signal. It exited normally?";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Flaky ASSERT_TRUE(WIFSIGNALED) when process exits before kill

If the optimizer finishes naturally within the seconds window, kill(pid, 0) will return -1 (process gone), the kill is skipped, and waitpid will collect a normal-exit status — causing ASSERT_TRUE(WIFSIGNALED(status)) to fail with "Child process was not killed by a signal." The same pattern already exists in write_recovery_test.cc (RunGeneratorAndCrash), but the concern is amplified here because on faster machines the optimize of 2000 * 50 = 100 000 documents might complete quickly.

Consider using a softer assertion (e.g., accept either WIFEXITED or WIFSIGNALED) or asserting that the remaining verification logic still holds regardless of which exit path occurred.

if (s.ok()) {
std::cout << "Optimize completed successfully" << std::endl;
// Print final stats
std::cout << "Final stats: " << collection->Stats()->to_string_formatted();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing newline after final stats output

The final stats line has no std::endl or "\n", so the output may be buffered and mixed with subsequent output (e.g., the process's exit trace) without a line break. All other output statements in this file include std::endl.

Suggested change
std::cout << "Final stats: " << collection->Stats()->to_string_formatted();
std::cout << "Final stats: " << collection->Stats()->to_string_formatted() << std::endl;

Comment on lines +197 to +205
// VectorQuery query;
// query.output_fields_ = {"name", "age"};
// query.filter_ = "invert_id >= 6000 and id < 6080";
// query.topk_ = 100;
// std::vector<float> feature(4, 0.0);
// query.query_vector_.assign((const char *)feature.data(),
// feature.size() * sizeof(float));
// query.field_name_ = "dense";
// collection->Query();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Commented-out dead code should be removed

Leaving a commented-out VectorQuery block in the test adds noise without providing value. If this functionality is intended to be tested, it should be enabled; otherwise, it should be deleted.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@zhourrr zhourrr changed the title feat: add crash recovery test for optimize fix: recovery from crash during optimization Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants